Engineering a Distributed Full-Text Index
نویسندگان
چکیده
We present a distributed full-text index for big data applications in a distributed environment. The index can be used to answer different types of pattern matching queries (existential, counting and enumeration) and also be extended to answer document retrieval queries (counting, retrieve and top-k). We also show that succinct data structures are indeed useful for big data applications, as their low memory consumption allows us to build indices for larger slices of text in the main memory.
منابع مشابه
Clustered Distributed Index for Efficient Text Retrieval Using Threads
In this research paper, a novel method of improving the clustered distributed indices for efficient text retrieval using threads is presented. In text retrieval, text search refers to a technique of searching stored document or database. In a full text search, the search engine examines all the words in every stored document as it tries to match search words supplied by the user. When dealing w...
متن کاملClustering Full Text Documents
An index or topic hierarchy of full-text documents can organize a domain and speed information retrieval. Traditional indexes, like the Library of Congress system or Dewey Decimal system, are generated by hand, updated infrequently, and applied inconsistently. With machine learning, they can be generated automatically, updated as new documents arrive, and applied consistently. Despite the appea...
متن کاملUsing Google Scholar to Search for Online Availability of a Cited Article in Engineering Disciplines
Many published studies examine the effectiveness of Google Scholar (Scholar) as an index for scholarly articles. This paper analyzes the value of Scholar in finding and labeling online full text of articles using titles from the citations of engineering faculty publications. For the fields of engineering and the engineering colleges in the study, Scholar identified online access for 25% of the ...
متن کاملA Two-Tier Distributed Full-Text Indexing System
The performance of indexing systems is very important for a search engine. Usually, indexing systems on large-scale clusters can provide high search efficiency, but it brings expensive hardware costs. The costs would be greatly reduced if a distributed indexing system runs on small-scale clusters connected by the Internet. Two current inverted file partitioning schemes: document partitioning an...
متن کاملA Distributed Digital Library Architecture Incorporating Different Index Styles
The New Zealand Digital Library offers several collections of information over the World Wide Web. Although full-text indexing is the primary access mechanism, musical collections can also be accessed through a novel melody retrieval system. In offering this service over a three-year period, we have had to face many practical challenges in building, maintaining, and administering diverse collec...
متن کامل